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Abstract — This primer explains how continuous-time stochastic 
processes (precisely, Brownian motion and other Ito diffusions) 
can be defined and studied on manifolds. No knowledge is 
assumed of either differential geometry or continuous-time pro- 
cesses. The arguably dry approach is avoided of first introducing 
differential geometry and only then introducing stochastic pro- 
cesses; both areas are motivated and developed jointly. 

Index Terms — Differential geometry, stochastic differential 
equations on manifolds, estimation theory on manifolds, 
continuous-time stochastic processes, ltd diffusions, Brownian 
motion, Lie groups. 

I. Introduction 

THE tools of calculus — differentiation, integration, Tay- 
lor series, the chain rule and so forth — have extensions 
to curved surfaces and more abstract manifolds, and a different 
set of extensions to stochastic processes. Stochastic differential 
geometry brings together these two extensions. 

This primer was written from the perspective that, to be 
useful, it should give more than a big-picture view by drilling 
down to shed light on important concepts otherwise obfuscated 
in highly technical language elsewhere. Gaining intuition, and 
gaining the ability to calculate, while going hand in hand, are 
distinct from each other As there is ample material catering 
for the latter |1 1-|4|, the focus is on the former. 

Brownian motion plays an important role in both theoretical 
and practical aspects of signal processing. Section[ll]is devoted 
to understanding how Brownian motion can be defined on a 
Riemannian manifold. The standpoint is that it is infinitely 
more useful to know how to simulate Brownian motion than 
to learn that the generator of Brownian motion on a manifold 
is the Laplace-Beltrami operator 

Stochastic development is introduced early on, in Sec- 
because "rolling without slipping'' 



Section VI examines more closely the theory of stochastic 



tion 



II-B 



is a simple yet 

striking visual aid for understanding curvature, the key feature 
making manifolds more complicated and more interesting 
than Euclidean space. Section III explains how stochastic 
development can be used to extend the concept of state-space 
models from Euclidean space to manifolds. This motivates the 
introduction of stochastic differential equations in Section IV 
Since textbooks abound on stochastic differential equations 
in Euclidean space, only a handful of pertinent facts are pre- 
sented. As explained in SectionjV] and despite appearances |2|, 
||5), ||6j, going from stochastic differential equations in Eu- 
clidean space to stochastic differential equations on manifolds 
is relatively straightforward conceptually if not technically. 
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integration. It explains (perhaps in a novel way) how ran- 
domness can make it easier for integrals to exist. It clarifies 
seemingly contradictory statements in the literature about path- 
wise integration. Finally, it emphasises that despite the internal 
complexities, stochastic integrals are constructed in the same 
basic way as other integrals and, from that perspective, are no 
more complicated than any other linear operator. 



The second half of the paper, starting with Section VII 



culminates in the generalisation of Gaussian random variables 
to compact Lie groups and the re-derivation of the formulae 
in |7 1 for estimating the parameters of these random variables. 
Particular attention is given to the special orthogonal groups 
consisting of orthogonal matrices having unit determinant, 
otherwise known as the rotation groups. Symmetry makes Lie 
groups particularly nice to work with. 

Estimation theory on manifolds is touched on in Section XI 
the message being that an understanding of how an estimator 
will be used is needed to avoid making ad hoc choices about 
how to assess the performance of an estimator, including what 
it means for an estimator to be unbiased. 

The reason for introducing continuous-time processes rather 
than ostensibly simpler discrete-time processes is that the 
only linear structure on manifolds is at the infinitesimal scale 
of tangent spaces, allowing the theory of continuous-time 
processes to carry over naturally to manifolds. A strategy 
for working with discrete-time processes is treating them as 
sampled versions of continuous-time processes. In the same 
vein. Section |IX] uses Brownian motion to generalise Gaussian 
random variables to Lie groups. 

There are numerous omissions from this primer. Even the 
Ito formula is not written down! Perhaps the most regrettable 
is not having the opportunity to explain why stochastic differ- 
entials are genuine (second-order) differentials. 

An endeavour to balance fluency and rigour has led to 
the label Technicality being given to paragraphs that may 
be skipped by readers favouring fluency. Similarly, regularity 
conditions necessary to make statements true are routinely 
omitted. Caveat lector. 

II. Simulating Brownian motion 
A. Background 

A continuous-time process in its most basic form is just 
an infinite collection of real -valued random variables X{t) 
indexed by time t € [0, oo). Specifying directly a joint distri- 
bution for an infinite number of random variables is generally 
not possible. Instead, the following two-stage approach is 
usually adopted for describing the statistical properties of a 
continuous-time process. 
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First and foremost, all the finite-dimensional joint distri- 
butions of X{t) are given; they determine most, but not all, 
statistical properties of X{t). In detail, the finite-dimensional 
joint distributions are the distributions of X{t) for each r, 
the pairwise joint distributions of X{ti) and X{t2) for all 
Ti T2, and in general the joint distributions of X{ti) to 
X{Tn) for a finite but arbitrary n. 

For fixed r, it is emphasised that X{t) is simply a random 
variable and should be treated as such; that X{t) is a process 
is only relevant when looking at integrals or other limits 
involving an infinite number of points. However, the finite- 
dimensional distributions on their own are inadequate for spec- 
ifying the distributions of such limits. To exemplify, choose 
each X{t) to be an independent Gaussian random variable 
with zero mean and unit variance, denoted X{t) ^ iV(0, 1). 
Although formally a process, there is no relationship between 
any property of the index set [0, oo) and any statistical property 
of the random variables X{t), t £ [0, oo). For this process, 
limj_j.T- X{t) and X{t) dt do not even exist ijs p.45]. 

Markov processes are examples of a relationship existing 
between properties of the index set and statistical properties 
of the random variables; a process is Markov if the distribution 
of any future point X{t + h), h > 0, given past history 
{X{ti),--- ,X{Tn) \ Ti < ■ ■ ■ < Tn = t}. Only depends 
on X{t). This memoryless property of Markov processes 
relates the ordering of the index set [0,oo) to conditional 
independence of the random variables. 

Other examples are processes with continuous sample paths, 
where the topology of the index set relates to convergence of 
random variables. In detail, if X is a random variable then 
it is customary to denote an outcome of X by x. Similarly, 
let x{t) denote the realisation of a process X{t). When x{t) 
is considered as the function t i— >^ x{t), it is called a sample 
path. If (almost) all reahsations x{t) of a process X{t) have 
continuous sample paths, meaning t i-^ x{t) is continuous, 
then the process itself is called continuous. 

The second step for defining a continuous-time process is 
describing additional properties of the sample paths. A typical 
example is declaring that all sample paths are continuous. 
Although the finite-dimensional distributions do not define a 
process uniquely — so-called modifications are possible — the 
additional requirement that the sample paths are continuous 
ensures uniqueness. (Existence is a different matter; not all 
finite-dimensional distributions are compatible with requiring 
continuity of sample paths.) 

Technicality: For processes whose sample paths are continu- 
ous, the underlying probability space |9| can be taken to be 
(C([0, oo)), S', 7^) where C([0,oo)) is the vector space of all 
real-valued continuous functions on the interval [0, oo) and ^ 
is the (T-algebra generated by all cylinder sets. The probability 
measure V is uniquely determined by the finite-dimensional 
distributions of the process. 

If all finite-dimensional distributions are Gaussian then the 
process itself is called a Gaussian process. Linear systems 
preserve Gaussianity; the output of a linear system driven by 
a Gaussian process is itself a Gaussian process. This leads 
to an elegant and powerful theory of Gaussian processes in 
Hnear systems theory, and is the theory often found in signal 
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Fig. 1. Sample paths of Brownian motion on tlie closed interval [0, 1] 
generated as described in the text using St = 10~*. 



processing textbooks. Since manifolds are inherently nonlin- 
ear, such a simplifying theory does not exist for processes 
on manifolds. (Brownian motion can be defined on manifolds 
without reference to Gaussian random variables. Gaussian 
random fields can be defined on manifolds, but these are real- 
valued processes indexed by a manifold-valued parameter, as 
opposed to the manifold-valued processes indexed by time that 
are the protagonists of this primer) 

The archetypal continuous Markov process is Brownian 
motion, normally defined via its finite-dimensional distribu- 
tions and continuity of its sample paths pO| Section 2.2]. In 
the spirit of this primer though (and that of fTT\), processes 
are best understood in the first instance by knowing how to 
simulate them. The sample paths of Brownian motion plotted 
in Figure [T| were generated as follows. Set X{0) = and fix 
a step size St > 0. Note 6t is not the product of S and t but 
merely the name of a positive real-valued quantity indicating 
a suitably small discretisation of time. Let W{0),W{1), ■ ■ ■ 
be independent A^(0, 1) Gaussian random variables. Recur- 
sively define X{{k + 1) St) = X{k5t) + V5iW{k) for 
/c = 0, 1, 2, • • • . At non-integral values, define X{t) by linear 
interpolation of its neighbouring points: X{{k + a)5t) = 
{l-a)X{k5t)+aX{{k + l) St) for a g (0, 1). The generated 
process has the correct distribution at integral sample points 
t = 0, 6t, 2 6t, 3St, - ■ ■ and overall is an approximation of 
Brownian motion converging (in distribution) to Brownian 
motion as St — 0. 

Technicalities: Brownian motion is the unique continuous 
process that has stationary and independent increments; these 
properties force the process to be Gaussian |j_2J, a conse- 
quence of the central limit theorem. Had the W{k) been 
replaced by any other distribution with zero mean and unit 
variance, the resulting process X{t) still would have converged 
to Brownian motion as 6t — > 0. Alternative methods for gener- 
ating Brownian motion on the interval [0, 1] include truncating 
the Karhunen-Loeve expansion, and successive refinements to 
the grid: X{0) = 0, X{1) ^ 7V(0, 1), and given neighbouring 
points X{t) and X{t + St), a mid-point is added by the 
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rule X{t + f ) - ^(*)+^(*+^*) ^ v^7v(0, 1), thus allowing 

to be computed with 6t = 1, then and 
with (5t = i and so forth. Books specifically on Brownian 
motion include p2)-||T4). The origins of the mathematical 
concept of Brownian motion trace back to three independent 
sources; Thiele (1880), BacheUer (1900) and Einstein (1905). 
According to |15|, "Of these three models, those of Thiele 
and Bachelier had little impact for a long time, while that of 
Einstein was immediately influential". 

B. Brownian Motion and Stochastic Development 

Nothing is lost for the moment by treating manifolds as 
"curved surfaces such as the circle or sphere". 

Brownian motion models a particle bombarded randomly 
by much smaller molecules. The recursion X( {k + 1) 5t) = 

is thus 
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X{k5t) + \/5tW{k) introduced in Section 
(loosely) interpreted as a particle being bombarded at regular 
time instants. Between bombardments, there is no force acting 
on the particle, hence the particle's trajectory must be a curve 
of zero acceleration. In Euclidean space, this implies particles 
move in straight lines between bombardments, and explains 
why Unear interpolation was used earlier to connect X{kdt) 
to X{{k+1) St). On a manifold, a curve with zero acceleration 
is called a geodesic. Between bombardments, a particle on a 
manifold travels along a geodesic. 

Conceptually then, a piecewise approximation to Brownian 
motion on a manifold can be generated essentially as before, 
just with straight-line motion replaced by geodesic motion. 

Since the Earth is approximately a sphere, long-distance 
travel gives an intuitive understanding of the concepts of 
distance, velocity and acceleration of a particle moving on the 
surface of a sphere. Travelling "in a straight line" on the Earth 
actually means travelling along a great circle; great circles are 
the geodesies of the sphere. 

There are different ways of understanding geodesies, but the 
most relevant for subsequent developments is the following: 
rolling a sphere, without slipping, over a straight line drawn 
in wet ink on a flat table will impart a curve on the sphere 
that precisely traces out a geodesic. 

Rolling a manifold over a piecewise smooth curve in 
Euclidean space to obtain a curve on the manifold is called 
development. The development of a piecewise linear curve is 
a piecewise geodesic curve. If x{t) = [xi{t) , X2{t)) is the 
piecewise linear approximation in Figure [2] of a realisation of 
two-dimensional Brownian motion then the developed piece- 
wise geodesic curve obtained by rolling the sphere over the 
curve in Figure [2] is a piecewise geodesic approximation of a 
realisation of Brownian motion on the sphere. 

Development is defined mathematically as the solution of 
a certain differential equation. If a curve is not differentiable 
then it cannot be developed in the classical sense. A sample 
path of Brownian motion is nowhere differentiable! Therefore, 
in the first instance, development can only be used to take 
a piecewise smooth approximation of Brownian motion in 
Euclidean space and obtain a piecewise smooth approximation 
of Brownian motion on a manifold. Nevertheless, it is possible 
to "take limits" and develop a theory of stochastic devel- 
opment. The stochastic development of Brownian motion in 




0.05 
0.00 
-0.05 
-0.10 
-0.15 
-0.20 
-0.25 
-0.30 
-0.35 
-0.40 



Fig. 2. Path traced out by a veiy short segment of a single realisation of a 
two-dimensional Brownian motion starting from the diamond at the origin and 
stopping at the circle; X{t) = {X\(t), X2{t)) is two-dimensional Brownian 
motion if and only if Xi it) and X2 (t) are one-dimensional Brownian motions 
independent of each other. Time has been omitted. 



Euclidean space is the limiting process obtained by developing 
successively more accurate piecewise smooth approximations 
of the Brownian motion in Euchdean space. 
Technicality: This "smooth approximation" approach to 
stochastic development is more tedious to make rigorous 
than the stochastic differential equation approach taken in ||6] 
Chapter 2] but offers more intuition for the neophyte. 

The inverse of development is anti-development and can 
be visualised as drawing a curve on a manifold in wet 
ink and rolling the manifold along this curve on a table 
thereby leaving behind a curve on the table. Its stochastic 
counterpart, stochastic anti-development, can be thought of as 
the limiting behaviour of the anti-development of piecewise 
geodesic approximations of sample paths. 

As a rule of thumb, techniques (such as filtering |16|) for 
processes on Euclidean space can be extended to processes on 
manifolds by using stochastic anti-development to convert the 
processes from manifold-valued to Euclidean-space-valued. 
Although this may not always be computationally attractive, 
it nevertheless affords a relatively simple viewpoint. 

A process on a manifold is Brownian motion if and only 
if its stochastic anti-development is Brownian motion in Eu- 
clidean space |2, (8.26)]. To put this in perspective, note 
Brownian motion on a manifold cannot be defined using finite- 
dimensional distributions because there is no direct defini- 
tion of a Gaussian random variable on a manifold. (Even 
disregarding the global topology of a manifold, a random 
variable which is Gaussian with respect to one chart need 
not be Gaussian with respect to another chart.) Much of 
the Euclidean-space theory relies on linearity, and the only 
linearity on manifolds is at the infinitesimal scale of tangent 
spaces. Stochastic development operates at the infinitesimal 
scale, replicating as faithfully as possible on a manifold a 
process in Euclidean space. 

Although Gaussian random variables cannot be used to 
define Brownian motion on a manifold, the reverse is possible; 
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Brownian motion can be used to generalise the definition of a 
Gaussian random variable to a manifold; see Section |IX] 
Technicalities: There are a number of characterisations of 
Brownian motion that can be used to generalise Brownian 
motion to manifolds, including as the unique Ito diffusion gen- 
erated by the Laplace operator. The end result is nevertheless 
the same ||2|, ||3), ||5|, ||6), |[17)-|[19). Whereas for determin- 
istic signal processing, and in particular for optimisation on 
manifolds, there is benefit in not endowing a manifold with 
a Riemannian metric | |20| , | [2T| , Brownian motion must be 
defined with respect to a Riemannian metric. Although some 
concepts, such as that of a semimartingale, can be defined 
on a non-Riemannian manifold, it is simplest here to assume 
throughout that all manifolds are Riemannian manifolds. 

C. The Geometry of the Sphere 

This section continues the parallel threads of stochastic 
development and Brownian motion. Equations are derived 
for simulating Brownian motion on a sphere by rolling a 
sphere along a simulated path of Brownian motion in M^. It 
is convenient to change reference frames and roll a sheet of 
paper around a sphere than roll a sphere on a sheet of paper. 
Exercise: Mentally or otherwise, take a sheet of graph paper 
and a soccer ball. Mark a point roughly in the middle of the 
graph paper as being the origin, and draw two unit-length 
vectors at right-angles to each other based at the origin. Place 
the origin of the graph paper on top of the ball. Roll the 
paper down the ball until arriving at the equator Then roll the 
paper along the equator for some distance, then roll it back 
up to the top of the ball. Compare the orientation of the two 
vectors at the start and at the end of this exercise; in general, 
the orientation will have changed due to the curvature of the 
sphere. (Furthermore, in general, the origin of the graph paper 
will no longer be touching the top of the sphere.) 

Throughout, any norm || • || or inner product (•,•) on M" 
is the Euclidean norm or inner product. Perpendicular vectors 
are denoted v J- p, meaning {v,p) = 0. 

Take the sphere = {(a;i, 2:2, 2:3) e \ x\ x\ + 
x\ — 1}. Start the Brownian motion at the North pole: 
B(0) = (0,0,1). Place a sheet of graph paper on top of 
the sphere, so the origin (0, 0) of the graph paper makes 
contact with the North pole (0,0,1). Let Wi(0) and ^2(0) 
be realisations of independent iV(0, 1) random variables. On 
the graph paper, draw a line segment from the origin ( 0, 0) 
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to the point (V(5t?«i(0), v^i u;2(0)); recall from Section 
and Figure |2] that this is the first segment of an approximate 
sample path of Brownian motion on M^. The paper is sitting 
inside E'^, and the point {\fEtwxifS), ^2(0)) on the paper 
is actually located at the point (A/(5twi(0), ^2(0), 1) in 
W' because the paper is lying flat on top of the sphere. 

Rolling the paper down the sphere at constant velocity along 
the line segment so it reaches the end of the segment at time 
8t results in the point of contact between the paper and the 
sphere being given by 



for t e [0, Si\, where p = (0, 0, 1) is the original point of 
contact (the North pole) and v = {VSiwi{0),VStw2{0),0). 
This can be derived from first principles by requiring B{t) to 
remain in a plane and have constant angular velocity. 

In general, given any p G and v _L p, let Expp(u) denote 
the final contact point of the sphere and piece of paper obtained 
by starting with the paper touching the sphere at p, marking 
on the paper a line segment from p to p + v, and rolling the 
paper over the sphere along that line. (Since w _L p, and the 
paper is tangent to the sphere at p, the point p + v will lie on 
the paper) The curve 1 Expp(iu) is a geodesic and follows 
a great circle. Explicitly, Expp(O) = p and, for ||u|| ^ 0, 



Expp('i;) — cos (||i;||)p + sin (||w||) 



(2) 



B{t) = cos 



p + sm 



(1) 



At the risk of belabouring the point, if 5^ represents the Earth 
and a person sets out from a point p with initial velocity v 
and continues "with constant velocity in a straight line" then 
his position at time t will be Expp(to). In detail, each step 
actually involves first taking a perfectly straight step in the 
tangent plane, meaning the leading foot will be slightly off 
the Earth, then without swivelling on the other foot, letting 
the force of gravity pull the leading foot down to the closest 
point on Earth. This is a discrete version of "rolling without 
slipping" and hence produces (or defines) a geodesic in the 
limit as smaller and smaller steps are taken. 

The Riemannian exponential map Expp(u) can be defined 
analogously on any Riemannian manifold. The set of allowable 
velocities v for which Expp(w) makes sense is called the 
tangent space to the manifold at the point p; just like for the 
sphere, the tangent space can be visualised as a sheet of paper 
providing the best linear approximation to the shape of the 
manifold in a neighbourhood of the point of contact and v 
must be such that p + v lies on this (infinitely large) sheet 
of paper. Alternatively, if a sufficiently small (or sufficiently 
short-sighted) ant were standing on the manifold at point p, so 
that the manifold looked flat, then the set of possible directions 
(with arbitrary magnitudes) the ant could set out in from his 
perspective forms the tangent space at p. 
Technicalities: The Riemannian exponential function is de- 
fined via a differential equation. If the manifold is not com- 
plete, the differential equation may "blow up"; this technicality 
is ignored throughout the primer. Since every Riemannian 
manifold can be embedded in a sufficiently high-dimensional 
Euclidean space, this primer assumes for simplicity that all 
Riemannian manifolds are subsets of Euclidean space. The 
Riemannian geometry of such a manifold is determined by 
the Euclidean geometry of the ambient space; the Euclidean 
inner product induces an inner product on each tangent space. 
This is consistent with defining the length of a curve on a 
manifold as the Euclidean length of that curve when viewed 
as a curve in the ambient Euclidean space. 

Returning to simulating Brownian motion on the sphere, 
recall the original strategy was to generate a Brownian motion 
on the plane then develop it onto the sphere. Carrying this out 
exactly would involve keeping track of the orientation of the 
paper as it moved over the sphere. Although this is easily done, 
it is not necessary for simulating Brownian motion because 
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Gaussian random vectors are symmetric and hence invariant 
with respect to changes in orientation. 

If a particle undergoing Brownian motion is currently at 
the point p E S^, its next position, after 6t units of time, 
can be simulated by generating a three-dimensional Gaussian 
random vector v ^ N{0,I) E M.^, projecting v onto TpS^ — 
replace v hy v — {v,p)p — and declaring the next position of 
the particle to be 'Expp{y/dt v) . This generalises immediately 



The first requirement placed on M is that it infinitesimally 



look like 



Precisely, it is required that TpM C M" is an 



to arbitrary manifolds and is summarised in Section II-E 



(Alternatively, given an orthonormal basis for TpS , a two- 
dimensional Gaussian random vector could have been used to 
generate an appropriate random element of TpS^.) 

Although orientation was ultimately not needed for defining 
Brownian motion, it is an important concept by which to 
understand curvature, and enters the picture for more general 
processes (such as Brownian motion with drift). 
Technicality: For a non-embedded manifold M, the natu- 
ral setting for (stochastic) development is the frame bundle 
J"(M) of M equipped with a connection (3), ||5|l, ||^. The 
connection decomposes J-(M) into horizontal and vertical 
components, and leads to the concept of a horizontal process. 
A horizontal process on T{M) is essentially a process on 
M augmented by its current orientation. If M is Riemannian 
then the orthonormal frame bundle 0{M) can be used instead 
of T{M). Horizontal Brownian motion can be defined on 
0{M) via a Stratonovich stochastic differential equation that 
stochastically develops Brownian motion in Euclidean space 
onto the horizontal component (with respect to the Levi-Civita 
connection) of 0{M). The bundle projection of this horizontal 
Brownian motion yields Brownian motion on M. 

D. A Working Definition of a Riemannian Manifold 

For the purposes of this primer, manifolds are defined as 
subsets of Euclidean space that are sufficiently nice to permit 
a useful theory of differentiation of functions from one such 
subset to another. (Furthermore, only C°° -smooth manifolds 
are discussed.) Conditions will be given for a subset M C M" 
to be an m-dimensional manifold for some positive integer 
m < n. (A zero-dimensional manifold is a countable col- 
lection of isolated points and will not be considered further.) 
This will confirm the circle and sphere S*^ as manifolds 
of dimension one and two, respectively. Graphs of smooth 
functions are prototypical manifolds: if /: M'" M"^'" is 
a smooth function, meaning derivatives of all orders exist, its 
graph M = {{x, f{x)) e M™ x M"-™ M" | x e M"} is an 
TO-dimensional manifold. 

For each point p E M, define TpM as the set of all possible 
velocity vectors 7'(0) taken on by smooth curves 7: M ^ 
M" whose images are wholly contained in M and that pass 
through p at time 0. For example, if p E M = then the 
(only) requirements on 7 are that it is smooth, that 7(0) = p 
and ||7(i)|| = 1. In symbols, 

TpM = {7'(0) I 7 : M ^ M", 7(0) - p, 1^ C M} (3) 



where it is implicitly understood that 7 must be infinitely 
differentiable. (No difference results if 7 is only defined on 
an open neighbourhood of the origin; usually such curves are 
denoted 7: (— e, e) M in the literature.) 



m-dimensional vector subspace of M" for every p E M. This 
prevents M from having (non-tangential) self-intersections, 
e.g., the letter X is not a manifold, and it prevents M from 
having cusps, e.g., the letter V is not a manifold because no 
smooth curve 7 passes through the bottom tip of V except for 
the constant curve with 7'(0) — 0. 

Usually this first requirement is not stated because it is 
subsumed by requiring the manifold be locally Euclidean, 
defined presently. Nevertheless, it emphasises the importance 
of tangent spaces. The visual image of a piece of paper placed 
against a sphere at the point p G 5^ C is the affine tangent 
space. The tangent space TpS^ is obtained by taking the piece 
of paper and translating it to pass through the origin of the 
ambient space M?. This distinction is usually blurred. 

The first requirement fails to disqualify the figure of eight 
from being a manifold because it cannot detect tangential self- 
intersections. This can only be detected by considering what 
is happening in a neighbourhood of each and every point; it 
is required that for all p E M C M" there exists a sufficiently 
small open ball Br{p) — {z E M" | \\z — < r} of radius 
r > and a diffeomorphism h: M" — Br{p), meaning h is 
bijective and both h and its inverse are smooth, such that 



h{«^ X {Q}) = Br{p) r\ M 



(4) 



where M™ x {0} = {{x, 0) eW\xE M"} is the embedding 
of M™ into M" obtained by setting the last n — m coordinates 
equal to zero. (The basic intuition is that the classical calculus 
on a flat subspace M™ x {0} of M" should be extendable to a 
calculus on diffeomorphic images of this flat subspace.) 

The restriction of ft, in Q to M™ x {0} is a parametrisation 
of a part of the manifold M, however, the condition is stronger 
than this since it requires the parametrisation include all points 
of M in Br{p) n M and no other points. This excludes the 
figure "8" because at the point where the top and bottom 
circles meet, every one-dimensional parametrisation can get 
at best only part of the lower hemisphere of the top circle 
and the upper hemisphere of the bottom circle. In fact, Q 
ensures that every manifold locally looks like a rotated graph 
of a smooth function. 

A manifold M C M" inherits a Riemannian structure from 
the Euclidean inner product on M". This leads to defining 
the acceleration of a curve 7: M — > A/ C M" at time t as 
Hy[t) where iTp-. M" — > TpM is orthogonal projection 

onto TpM. The curve 7 is a geodesic if and only if 'y"{t) 
contains only those vectorial components necessary to keep 
the curve on the manifold, that is, 7r,^(t) {'j"{t)) = for all t. 
This same condition could have been deduced by developing 



a straight line onto M, as in Section II-B 
Technicalities: On a non-Riemannian manifold, there is a 
priori no way of defining the acceleration of a curve because 
there is no distinguished way of aligning TpM and TqM for 
two distinct points p and q. The above definition implicitly 
uses the Riemannian structure coming from the ambient space 
M" and accords with comparing tangent vectors at two distinct 
points of a curve by placing a piece of paper over the manifold 
at point q, drawing the tangent vector at q on the piece of 
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paper, then rolling the paper to p along the curve, and drawing 
the tangent vector at p on the paper. Because the piece of 
paper represents Euclidean space, the base points of the vectors 
drawn on the paper can be translated in the usual way so that 
they align. This then allows the difference of the two vectors to 
be taken, and ultimately, allows the acceleration to be defined 
as the rate of change of the velocity vectors. A mechanism 
for aligning neighbouring affine tangent spaces along a curve 
is called an affine connection. The particular affine connec- 
tion described here is the Levi-Civita connection, the unique 
torsion-free connection that is compatible with the Riemannian 
metric. In more advanced settings, there may be advantages 
to using other affine connections. A limitation of insisting 
manifolds are subsets of Euclidean space is that changing to a 
different metric requires changing the embedding, for example, 
changing the sphere into an ellipse. 

E. Brownian Motion on Manifolds 

Assembling the pieces leads to the following algorithm for 
simulating Brownian motion on an m-dimensional Rieman- 
nian manifold M C W\ 

Choose a starting point on M; set -B(O) to this point. 
Fix a step size 5t > 0. For k = 0, 1, - -, recursively do 
the following. Generate a Gaussian random vector W{k) e 



'-B{k&t) 



M C M", either with the help of an orthonormal basis 



for TB(kSt)M, or by generating an n-dimensional N{0,I) 
Gaussian random vector and projecting the vector orthogonally 
onto TB(^kSt)M to obtain W{k). Then define 



B{kSt + t)= Exps 



(kSt) 



St 



StW{k) 



(5) 



for t e [0,St]. If M = M" then Exp^) = p + v and ^ 



agrees with the algorithm in Section II-A 



Technicality: An advantage of projecting orthogonally onto 
the tangent space rather than constructing an orthonormal 
basis is that while the orthogonal projection iTp : E" TpM 
varies smoothly in p, on non-parallelisable manifolds it is not 
possible to find a continuous mapping from p G A/ to an 
orthonormal basis of TpM. The hairy ball theorem implies 
that the sphere is not parallelisable. (In fact, in terms of 
spheres, only 5", S^, and S"^ are parallelisable.) 

It is verified in p2| that, as St — > 0, the above approxi- 
mation converges in distribution to Brownian motion, where 
Brownian motion is defined by some other means. (For the 
special case of Lie groups, see also |23J. For hypersurfaces, 
see p4).) Nevertheless, engineers (and physicists) may find it 
attractive to treat the limit of (|5]) as the definition of Brownian 
motion. All (|5]l is saying is that at each step, movement in 
any direction is equally likely and independent of the past. 
By the central limit theorem, it suffices for the W{k) to 
have zero mean and unit variance; see p5) for an analysis 
of an algorithm commonly used in practice. The presence of 
the square root in the term a/^ is easily explained by the 
compatibility requirement that the variance of B{T) generated 
using a step size of St be equal to the variance of B{T) 
generated using a step size of ^; if this were not so then 
the processes need not converge as St 0. 



While (|5]l is suitable for numerical work, for calculations by 
hand it is convenient to "take limits" and work with the actual 
process. A direct analogy is p referring to work with ^ rather 



than 



x{t+St)-x{t) 
It 



Section 



IV 



introduces a stochastic calculus. 



III. State-Space Models on Manifolds 



A. Motivation 

Signal processing involves generating new processes from 
old. In Euclidean space, a process can be passed through a 
linear time-invariant system to obtain a new process. This 
can be written in terms of an integral and motivates asking 
if a continuous-time process evolving on a manifold, such as 
Brownian motion, can be integrated to obtain a new process. 

Another obvious question is how to generalise to manifolds 
state-space models with additive noise. The classical linear 
discrete-time state-space model is 



Yk = CkXk + DkWk 



(6) 
(7) 



where Ak, Bk,Ck, Dk are matrices and Vk,Wk are random 
vectors (noise). The vector Xk is the state at time fc, and the 
state-space equation represents the dynamics governing how 
the state changes over time. It comprises a deterministic part 
AkXk and a stochastic part B^Vk. The second equation is 
the observation equation: the only measurement of the state 
available at time k is the vector Yk comprising a linear function 
of the state and additive noise. 

There does not appear to be a natural generalisation of 
discrete-time state-space models to arbitrary manifolds be- 
cause it is not clear how to handle the addition of the two 
terms in each equation. (In some cases, group actions could be 
used.) It will be seen presently that continuous-time state-space 
models generalise more easily. This suggests the expediency 
of treating discrete-time processes as sampled versions of 
continuous-time processes. 

The continuous-time version of (|6| would be 



(8) 



dX ,,,,,,, r., .dV(t) 
-^Ait)Xit)^Bit)-n 

if the noise process V{t) was sufficiently nice that its sample 
paths were absolutely continuous; that this generally is not the 
case is ignored for the moment. 

Although Y{t) = C{t)X{t) + D{t)W{t) is an analogue of 
(j7]l, usually the observation process takes instead the form 



dY_ 
~dt 



C{t)X{t)+D{t) 



dW{t) 
dt 



(9) 



which integrates rather than instantaneously samples the state. 

Although the right-hand sides of (|6]l and ([8]) are sums of two 
terms, crucially, it is two tangent vectors being summed in ([8]l. 
Two points on a manifold cannot be added but two tangent 
vectors in the same tangent space can. Therefore, ([8]l and (j9]l 
extend naturally to manifolds provided the terms A{t)X{t) 
and C{t)X{t) are generalised to be of the form b{t,X{t)); 
see ([Toji. The challenge is if V{t) and W{t) are Brownian 
motion then (|8]l and (|9]l require an appropriate interpretation 
because Brownian motion is nowhere differentiable (almost 
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surely). The following subsections hint at how this is done 



via piecewise approximations, and Section IV gives a rigorous 
interpretation by changing (j8]l and (j9]) to integral form. 

B. Modelling the State Process 

Building on the material in Sections [ll| and [lirA] an attempt 
is made to model a particle moving on a sphere subject to 
disturbance by Brownian motion. Let X{t) G S*^ C M"^ 
denote the position of the particle at time t. Its deterministic 
component can be specified by a differential equation 

dX 



dt 



= b{t,X{t)). 



(10) 



Provided b{t,X{t)) lies in the tangent space of the sphere at 
X{t), the solution of ( [To| ) is forced to lie on the sphere if 
the initial point X{Q) does. A simple numerical solution of 



( 10 1 is obtained by combining the forward-Euler method with 
the Riemannian exponential function, the latter ensuring the 
approximate solution remains on the sphere: 

Xit + St) = Exp^(,) {St b{t, X{t))) . (1 1) 

Referring to (j5]l with t = St, the following idea presents itself: 

X{t + St) ^ Expx(t) {St b{t, X{t)) + y/Si W{t)) (12) 

where W{t) is an iV(0,/) Gaussian random vector in 
projected orthogonally onto the tangent space Tx(t)S'^. The 
(approximately instantaneous) velocity of the particle at time 
t is the sum of a deterministic component and a random 
component. Continuous- time approximations can be obtained 
by interpolating using geodesies, as in (|5]l, in which case ( 12 1 
converges to a well-defined process on the sphere |26l. 



C. Modelling the Observation Process 

Notwithstanding that the first two cases are subsumed by 
the third, the three cases of interest are: the state process 
evolves in Euclidean space yet the observation process is 
manifold-valued; the state process evolves on a manifold but 
the observation process is real-valued; and, the state and 
observation processes are manifold-valued. 

If the state process X{t) evolves in Euclidean space then 
stochastic development can be used to feed it into the obser- 
vation process p6) : 

Y{t + St) = Expy(j) {X{t + St)- X{t) + ^/TtW{t)). (13) 

If X{t) is not observed directly, but only g{X{t)) where 
g is a smooth function between Euclidean spaces, then a 
straightforward modification of ( [T3| ) is 

Y{t + St) = Exp^(,) {g{Xit + St))^ g{X{t)) + VSt W{t)) . 

(14) 

In other words, first the new process X{t) = g{X{t)) is 
formed, then noise is added to it, and finally it is stochastically 
developed (via piecewise approximations) onto the manifold. 

If the state process X{t) evolves on a manifold of dimen- 
sion m but the observation process Y{t) is real-valued then 
stochastic anti-development can be used p7) . In a sufficiently 
small domain, Exp is invertible, and the instantaneous velocity 



of X{t), which in general does not exist, can nevertheless be 
approximated by Exp^^^^ {^{t + St) ^ ^(t))- This produces 
a vector in M™ which can be used to update an observation 
process evolving in Euclidean space. (By interpreting differen- 



tial equations as integral equations, as discussed in Section IV 



neither X{t) nor Y{t) need be differentiable for there to be 
a well-defined limiting relationship between the instantaneous 
velocities of piecewise approximations of the processes.) 

Finally, the general case of both X{t) and Y{t) evolving 
on manifolds falls under the framework of stochastic differ- 
ential equations between manifolds f28', Section 3]. Basically, 
F,xpx\t) '^^'1 used to obtain a real-valued vector approx- 
imating the instantaneous velocity of X{t), which after a 
possible transformation, can be fed into ExpY{t) to update 
the observation process Y{t). See ||2J for details. 

D. Discussion 

Section Hn] demonstrated, at least on an intuitive level, that 
stochastic development and anti-development, and the Exp 
map in particular, provide a relatively straightforward way of 
generalising state-space models to manifolds. 

Nevertheless, it is important to understand what the limiting 
processes are that Exp is being used to approximate. This is 
the purpose of stochastic calculus, and once it is appreciated 
that the "smooth approximation" approach discussed in this 
primer leads to a stochastic calculus, it is generally easier to 
use directly the stochastic calculus. 

IV. Stochastic Calculus and Ito Diffusions 

This section does not consider marufolds or processes with 
jumps |29( . Standard references include pO) , pO) . 

A. Background 



A generalisation of (lOi is the functional equation 



X{t) = X(0) 



h{s,X{s))ds. 



(15) 



A solution of (15i is any function X{t) for which both sides 



of ( 15 I exist and are equal. Every solution of ( 10 1 is a solution 
of (15 I but the converse need not hold; whereas X{t) must 



be differentiable for the left-hand side of ( [TOj i to make sense, 
there is no explicit differentiability requirement in ( 15 i, only 
the requirement that b{s^X{s)) be integrable. 

The same idea carries over to random processes; although 
Brownian motion cannot be differentiated, it can be used as 
an integrator, and the state-space equations (|8]l and (j9]) can be 
written rigorously as functional equations by integrating both 
sides. However, there is in general no unique way of defining 
the integral of a process. 

Since it may come as a surprise that different definitions 
of integrals can give different answers, this phenomena will 
be illustrated in the deterministic setting by considering how 
to integrate Holder continuous functions of order i. Such 
functions are continuous but may oscillate somewhat wildly. 
Technicality: Although sample paths of Brownian motion are 
almost surely not Holder continuous of order i, they are 



g 



almost surely Holder continuous of any order less than ^. 
This is irrelevant here because, as explained below, the relevant 
fact about Brownian motion is that E [\B{t + St) — B{t)\] is 
proportional to \^ rather than St. The appearance of the 
expectation operator characterises the stochastic approach to 
integration. Interestingly, a complementary theory known as 
rough paths has been developed recently |[3T)-p4), based 
partially on observations made in f35l, f36'| and pTJ. 

Recall the Riemann-Stieltjes integral j]^ f{s) dg{s) which, 
for smooth / and g, is a limit of Riemann sums: 



N-l 



iV — yoo ' ^ 



(16) 



The right-hand side of ( [T6| gives the same answer if the right, 
not left, endpoint is used for each interval [-^, ^^]. Indeed, 
the difference between using left or right endpoints is 



N-l 

E 

k=0 



/(#)) 



9m 



(17) 



If / is smooth then |/(^) - /(A; 



< a 



fN 



for some 



constant af £ . 

^N-l 1 



and analogously for g. Therefore, |eAr| < 
and converges to zero as N ^ oo. 



k=0 N N 

If now / and g are not differentiable, but merely Holder 

continuous of order i, then |/(^) - /(;|')| < a/|;^|2 for 
some constant a/ G M, and analogously for g, leading to 

lew I < afagY,k=o \jr\^ Ijjl^ which converges to a/ftg, 
and not zero, as iV — > oo. This means it is possible for two 
different values of the integral f{s) dg{s) to be obtained 
depending on whether f{j^) or /(^^) is used in (16 1. 
If at least one of / or g is smooth and the other is Holder 



continuous of order ^ then once again 



If g is replaced by real-valued Brownian motion B{t) 
and the above calculations carried out, a relevant quantity 
is the rate at which the expected value of \B{t + 5t) — 
B{t)\ decays to zero. Since B{t + St) - B{t) - N{Q,St), 
'E[\B{t + St) — B[t)\] is proportional to \/5i, analogous to 
\f{t + St) — f{t)\ cx ^/Si for Holder continuous functions 
of order i. Not surprisingly then, differences can appear for 
integrals of the form JX{t)dY{t) when X{t) and Y{t) 
are stochastic processes; other integrals, such as J X{t) dt 
and / h{t) dY{t), with h smooth, are unambiguous. (This 
presupposes X{t) and Y{t) are semimartingales p8).) 
Technicalities: Lebesgue-Stieltjes theory requires finite vari- 
ability, excluding Brownian motion as an integrator Holder 
continuous functions of order greater than ^ can be inte- 
grated with respect to Lipschitz functions using the Young 
integral without needing finite variability |39|. Brownian mo- 
tion falls just outside this condition. Stochastic integration 
theory depends crucially on integrators having nice statistical 
properties for Riemann-sum approximations to converge; see 
Section IVI-AI The sums are sensitive to second-order infor- 
mation (cf. Ito's formula |10|), hence "second-order calculus" 
is fundamental to stochastic geometry |2, Section VI]. 

B. ltd and Stratonovich Integrals: An Overview 

Non-equivalent definitions of stochastic integrals pO) , pT] 
all involve taking limits of approximations but differ in the 



approximations used and the processes that are allowed to be 
integrators and integrands. The two most common stochastic 
integrals are the Ito and Stratonovich integrals. 

The Ito integral leads to a rich probabilistic theory based 
on a class of processes known as semimartingales, and a 
resulting stochastic analysis that, in some ways, parallels 
functional analysis. A tenet of analysis is that properties 
of a function f{t) can be inferred from its derivative; for 
example, a bound on f(T) can be derived from bounds on 
f{t) because /(T) = f'{t) dt. (It is remai-kable how often 
it is easier to study an infinite number of linear problems, 
namely, examining f {t) for each and every t in the range 
to T.) Thinking of the random variable X{T) as an infinite 
sum of its infinitesimal differences — X{T) = dX{t) 
— suggests that by understanding the limiting behaviour of 
X{t + St) — X{t) as St 0, it is possible to infer properties 
of X{T) that may otherwise be difficult to infer directly. 
Technicality: If the decomposition Y(T) — dY{t) of 
Y{t) = f{X{t)) is sought, the Ito formula allows the limiting 
behaviour of Y{t + St) — Y{t) to be determined directly from 
the limiting behaviour of X{t + St) — X{t). 

The Ito integral does not respect geometry ||2i|; it does 
not transform "correctly" to allow a coordinate-independent 
definition. Nor does the Ito integral respect polygonal ap- 
proximations: if Y'^{t) is a sequence of piecewise linear 
approximations converging to Y{t) then it is not necessarily 
true that / X{t)dY''{t) J X{t)dY{t). 

The Stratonovich integral lacks features of the Ito integral 
but respects geometry and polygonal approximations, making 
it suitable for stochastic geometry and modelling physical 
phenomena. Fortunately, it is usually possible to convert from 
one to the other by adding a correction term, affording freedom 
to choose the simpler for the calculation at hand. 

By respecting geometry, the Stratonovich integral behaves 
like its deterministic counterpart; this is the transfer prin- 
ciple. The archetypal example is that the trajectory X{t) 
of a Stratonovich stochastic differential equation dX{t) = 
f{t,X{t)) o dB{t) stays on a manifold M if f{t,X{t)) lies 
in the tangent space Tx(t)M\ cf (10 1. 



In terms of modelling, the Ito integral is suitable for approx- 
imating inherently discrete-time systems by continuous-time 
systems (e.g., share trading), while the Stratonovich integral 
is suited to continuous-time physical processes because it 
describes the limit of piecewise smooth approximations f42l. 
Technicality: On a non-Riemannian manifold, a Stratonovich 
integral but not an Ito integral can be defined because only the 
former respects geometry. Only once a manifold is endowed 
with a connection can an Ito integral be defined. 

C. Semimartingales and Adapted Processes 

The class of processes called semimartingales emerged over 
time by attempts to push Ito's theory of integration to its 
natural limits. Originally defined as "adapted cadlag processes 
decomposable into the sum of a local martingale and a process 
of finite variation", the Bichteler-Dellacherie theorem ||43), 
| |44) states that semimartingales can be defined alternatively 
(in a simple and precise way Q) as the largest class of 
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"reasonable integrators" around which can be based a powerful 
and self-contained theory of Ito stochastic integration [38|. 
Technicality: By restricting the class of integrands, the class of 
integrators can be expanded beyond semimartingales, leading 
to an integration theory for fractional Brownian motion, for 
example. Nevertheless, the Ito theory remains the richest. 

From an engineering perspective, primary facts are: all Levy 
processes |45|, including the Poisson process and Brownian 
motion, are semimartingales, as are all (adapted) processes 
with continuously differentiable sample paths, and if a semi- 
martingale is passed into a system modelled by an Ito integral, 
the output will also be a semimartingale. From an analysis 
perspective, semimartingales are analogues of differentiable 
functions in that a process X{t) can be written, and studied. 



in terms of its differentials dX{t); see Section IV-B (Whereas 
the differential of a smooth function captures only first-order 
information, the Schwartz principle f^. (6.21)] is that dX{t) 
captures both first-order and second-order information. This 
is stochastic information though; sample paths of semimartin- 
gales need not be differentiable.) 

Crucial to Ito's development of his integral was the restric- 
tion to adapted processes: in J X dY, Ito required X{t) not 
to depend on Y{t) for t > t. (The borderline case t — t 
results in Ito's requirement that X be caglad and Y cadlag.) 
A filtration formalises what information has been revealed up 
to any given point in time. Adaptedness to the filtration at 
hand is a straightforward technical condition | lOJ taken for 
granted, and hence largely ignored, in this primer. 



for t € [0, 1] and non-negative integers k. (For notational 
convenience, S is being used here instead of St.) Then 



X{s)dY^{s) 



(23) 



is a well-defined ordinary integral, and Z^{t) Z{t) as 5 
0. By differentiating Y^{s), (23 1 becomes 



N-l Ak+l)S 
Y{(k+l)5)-Y(kS) I 

^ JkS 



k=0 



X{s)ds (24) 



where S = j^. 

rik+l)5 



This agrees in the limit with (21 
X{Tk) where Tk = (fc 



whenever 



I IkS ^(s) ^ ^(^fc) where = (fc + i)<5. 
Note: In the one-dimensional case, other reasonable approx- 
imations can be used. However, in general (i.e., when non- 
commuting vector fields are involved), failure to use piecewise 
linear approximations can lead to different answers |28|. 
Technicality: Unlike for the Ito integral, it is harder to pin 
down conditions for the Stratonovich integral to exist. This 
makes the definition of the Stratonovich integral a moving 
target. It is generally preferable to use (21 1 with X{Tk) 



replaced by ^{X{tk)+X{tk+i))', the modified sum apparently 
converges under milder conditions |47| . Alternatively, p8) 
declares the Stratonovich integral to exist if and only if the 
polygonal approximations with respect to all (not necessarily 
uniform) grids converge in probability to the same limit, the 
limit then being taken to be the value of the integral. For 
reasonable processes though, these definitions coincide. 



D. The ltd and Stratonovich Integrals 

Let X{t) be a continuous process and Y{t) a semimartin- 
gale (both adapted to the same filtration). The Ito integral 

Z{t) = [ X{s)dY{s) (18) 



can be interpreted as a system with transfer function X{t) that 
outputs the semimartingale Z{t) in response to the input Y{t). 
By 138] Theorem 11.21], ^ is the limit (in probability) 



N-l 



Z{t)^ lim y^X{tk){Y{tk+i)-Y{tk)) (19) 



fe=0 

where tk = jjt. (The Ito integral naturally extends to adapted 
caglad integrands. This suffices for stochastic differential equa- 
tions. With effort, further extensions are possible p8).) 
The Stratonovich integral |46| is denoted 

Z{t) = [ X{s)odY{s). (20) 
It can be thought of as the limit (in probability) 

JV-l 

Z{t)^ lim y2x{Tk){Y{tk+i)-Y{tk)) (21) 



k=0 



where — jjt and = *fc+*fc+i ^ Alternatively, it can be 
evaluated as a limit of ordinary integrals. Define the piecewise 
linear approximation 



Y\k5 + t) = {l-t)Y{k5)+tY[{k + l)5) 



(22) 



E. Stochastic Dijferential Equations 
The stochastic differential equation 

dX{t) = b{t, X{t)) dt + X{t)) dB{t) (25) 

is shorthand notation for the functional equation 

X{t)=X{0)+ f b{s,Xis))ds+ f j:{s,X{s))dB{s) 

(26) 

which asks for a semimartingale X{t) such that both sides of 



(26 1 are equal. In higher dimensions, the diffusion coefficient 
S is a function returning a matrix and the drift b a function 
returning a vector. 



The Ito equation ( 26 1 can be solved numerically by a stan- 
dard forward-Euler method (known in the stochastic setting 
as the Euler-Maruyama method) |49j-[|53J. This validates the 



otherwise ad hoc models developed in Section III 



Replacing the Ito integral by the Stratonovich integral 
results in a Stratonovich stochastic differential equation. Since 
the summation in (21 1 involves evaluating X{t) at a point 
in the future — the midpoint of the interval rather than 
the start of the interval — solving a Stratonovich equation 
necessitates either using an implicit integration scheme (such 
as a predictor-corrector method) or converting the Stratonovich 
equation into an Ito equation by adding a correction to the drift 

m 



coefficient |50| 



For example, in the one-dimensional 



case, solutions of the Stratonovich equation 

dX{t) = a{t,X{t))odB{t) 



(27) 



10 



correspond to solutions of the Ito equation 

r)rr 

dX{t) = \a{t,X{t)) —{t,X{t))dt + a{t,X{t))dB{t). 



(28) 



F. ltd Diffusions 

A process X{t) is an Ito diffusion if it can be written 



in the form (26i where B{t) is Brownian motion. (In fact. 



Ito developed his stochastic integral in order to write down 
directly continuous Markov processes based on their infinites- 
imal generators p5].) Stratonovich equations of the analogous 
form are also Ito diffusions. Ito diffusions are particularly nice 
to work with as they have a rich and well-established theory. 
Technicality: The class of continuous Markov processes should 
be amongst the simplest continuous-time processes to study 
since local behaviour presumably determines global behaviour 
Surprisingly then, even restricting attention to strongly Markov 
processes (thus ensuring the Markov property holds also at 
random stopping times) does not exclude complications, es- 



pecially in dimensions greater than one 1 19 Section IV.5]. The 
generic term "diffusion" is used with the aim of restricting at- 
tention to an amenable subclass of continuous strongly Markov 
processes. Often this results in diffusions being synonymous 
with Ito diffusions but sometimes diffusions are more general. 

V. From Euclidean Space To Manifolds 



Section III used piecewise geodesic approximations to de- 
fine state-space models on manifolds. These approximations 
converge to solutions of stochastic differential equations. 
Conversely, one way to understand Stratonovich differential 
equations on manifolds is as limits of ordinary differential 
equations on manifolds applied to piecewise geodesic approx- 
imations of sample paths |Q Theorem 7.24]. 

Stratonovich equations on a manifold M can be defined 
using only stochastic calculus in Euclidean space. A class 
of candidate solutions is needed: an Af -valued process Xt is 
a semimartingale if the "projected" process f{Xt) is a real- 
valued semimartingale for every smooth / : M — >■ M; a shift 
in focus henceforth makes the notation Xt preferable to X{t). 
A Stratonovich equation on M gives a rule for dXt in terms 
of Xt and a driving process. Whether a semimartingale Xt 
is a solution can be tested by comparing df{Xt) with what it 
should equal according to the stochastic chain rule; cf ( [35| ). 
If this test passes for every smooth f : M M. then Xt is 
deemed a solution ||6] Definition 1.2.3]. 

The following subsections touch on other ways of working 
with stochastic equations on manifolds. 

A. Local Coordinates 

Manifolds are formally defined via a collection of charts 
composing an atlas |56|. (Charts were not mentioned in 



Section II-D but exist by the implicit function theorem.) A 
direct analogy is the representation of the Earth's surface by 
charts (i.e., maps) in a cartographer's atlas. Working in local 
coordinates means choosing a chart — effectively, a projection 
of a portion of the manifold onto Euclidean space — and 



working with the projected image in Euclidean space using 
Cartesian coordinates. This is the most fundamental way of 
working with functions or processes on manifolds. 

A clear and concise example of working with stochastic 
equations in local coordinates is |57|. When written in local 
coordinates, stochastic differential equations on manifolds are 
simply stochastic differential equations in Euclidean space. 

Working in local coordinates requires ensuring consistency 
across charts. When drawing a flight path in a cartographer's 
atlas and nearing the edge of one page, changing to another 
page necessitates working with both pages at once, aligning 
them on their overlap. For theoretical work, this is usually only 
a minor inconvenience. The inconvenience may be greater for 
numerical work. 

B. Riemannian Exponential Map 

On a general manifold there is no distinguished choice of 
local coordinates; all compatible local coordinates are equally 
valid. On a Riemannian manifold though, the Riemannian 
exponential map is a distinguished choice of local coordinate 
system that has a number of attractive properties. It plays a 
central role in the smooth approximation approach to stochas- 
tic development, and has been used throughout this primer 
to generalise Brownian motion and stochastic differential 
equations to manifolds. 

The main disadvantage of the Riemannian exponential map 
is that, in general, it cannot be evaluated in closed form, 
and its numerical evaluation may be significantly slower than 
working instead with extrinsic coordinates or even other local 
coordinate systems. 

C. Extrinsic Coordinates 

Stochastic equations on manifolds embedded in Euclidean 
space can be written as stochastic equations in Euclidean space 
that just so happen to have solutions lying always on the 
manifold; see Section |VlI] 

Even though all manifolds are embeddable in Euclidean 
space, there is no reason to expect an arbitrary manifold to 
have an embedding in Euclidean space that is convenient to 
work with (or even possible to describe). The dimension of 
the embedding space may be considerably higher than the 
dimension of the manifold, making for inefficient numerical 
implementations. Numerical implementations may be prone 
to "falling off the manifold", or equivalently, incur increased 
computational complexity by projecting the solution back to 
the manifold at each step. (Falling off the manifold is only a 
numerical concern because the transfer principle makes it easy 
to constrain solutions of Stratonovich equations in M" to lie 
on a manifold M C M"; see Section [IV^ ) 
Technicality: Given the use of Grassmann manifolds in signal 
processing applications, it is remarked there is an embedding 
of Grassmann manifolds into matrix space given by using 
orthogonal projection matrices to represent subspaces. 

D. Intrinsic Operations 

Differential geometry is based on coordinate independence. 
On a Riemannian manifold, the instantaneous velocity and 
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acceleration of a curve have intrinsic meanings understand- 
able without reference to any particular coordinate system. 
Similarly, stochastic operators can be defined intrinsically |2|. 

With experience, working at the level of intrinsic operations 
is often the most convenient and genteel. It is generally the 
least suitable for numerical work. 

VI. A Closer Look at Integration 

In probability theory, the set of possible outcomes is denoted 
fi, and random variables are (measurable) functions from il 
to M. Once Tyche, Goddess of Chance, decides the outcome 
a; G ri, the values of all random variables lock into place |j9 |. 
Therefore, for clarity, stochastic processes will be written 
sometimes as Xt{u!). 

Several issues are faced when developing a theory for 
stochastic integrals 



Z{io) = / Xtiio)dYtiLj). 



(29) 



(This generalises to Zt{uj) — J^XtdYt giving a process Zt 
instead of a random variable Z.) That Xt must be predictable 
is already explained clearly in the literature; see the section 
"Naive Stochastic Integration Is Impossible" in p8) . Perhaps 
not so clearly explained in the literature is the repeated claim 
that since sample paths of Brownian motion have infinite varia- 
tion, a pathwise approach is not possible; a pathwise approach 
fixes ui and treats ( [29| as a deterministic integral of sample 
paths. Apparently contradicting this claim are publications on 
pathwise approaches [58) ! This is examined below in depth. 



A. Non-absolutely Convergent Series 

The definition of the Lebesgue-Stieltjes integral J f{t)dg{t) 
explicitly requires the integrator g{t) to have finite variation. 
This immediately disqualifies (|29]) as a Lebesgue-Stieltjes in- 
tegral whenever the sample paths of Yt have infinite variation. 
This is not a superfluous technicality; for integration theory 
to be useful, integrals and limits must interact nicely. At the 
very least, the bounded convergence theorem should hold: 



f^''\t)dg{t)^ / f[t)dg{t) (30) 



whenever the f^'^^t) are uniformly bounded. 

Let g{t) be a step function with g{0) = and transitions 
9(4) - 9{tn) = (-1)"+'^ at times < h < t2 < ■ ■ ■ < 1. 
Then 5(1) = lim^^^ E;^^i(-1)"+i i = ln2. Although 
declaring dg to equal 5(1) — 5(0) = In 2 may seem 
the obvious choice, it or any other choice would lead to a 
failure of the bounded convergence theorem: the f^^\t) in 
(30 1 can be chosen so that the j f^'^\t) dg{t) are partial 
sums of X]n(~-'-)"^^ n summands rearranged, and since 
^^(— l)"'+^i is not absolutely convergent, it can be made 
to equal any real number To achieve the limit 5, start with 



until 5 is exceeded, then add — — | — • • • until 



the partial sum drops below 5, and so forth. Other orderings 
can cause the partial sums to oscillate, preventing convergence. 

To summarise, if g{t) has infinite variation then the bounded 
convergence theorem will fail because "order matters". 



Let Yt{'jj) be a random step function with Y{){uj) = 
and each increment Y^+ — Y^- = having independent 
and equal chance of being positive or negative. Every sample 
path of Yt has infinite variation. Given the signs of each 
term in the sum 'Ylm ^ ~ become known, and based on this 
knowledge, sequences can be constructed that invalidate the 
bounded convergence theorem. 

In applications, the order is reversed: a uniformly bounded 
sequence X^^^^ is chosen with limit Xt, and of relevance is 
whether the bounded convergence theorem limt J X^^^ dYt = 
J Xt dYt holds for most if not all oj. Although a sequence can 
be chosen to cause trouble for a single outcome w, perhaps it is 
not possible for a sequence to cause trouble simultaneously for 
a significant portion of outcomes. If Xt can depend arbitrarily 
on UJ then trouble is easily caused for all uj, hence Ito's 
requirement for Xt to be predictable. To avoid this distraction. 



.(fe) 

For EJ-1)"+" 



assume the X^ are deterministic. 



- to converge to 5, or even to oscillate, 
requires long runs of positive signs followed by long runs of 
negative signs. Since the signs are chosen randomly, rearrang- 
ing summands may not be as disruptive as in the deterministic 
case. In fact, rearranging summands has no effect on the value 
of the sum for almost all outcomes lu. 

Technicality: Define the random variables An = Sjy = 

J2n=i and T/v = Y.n=i ^p(n) ^^cre p is a permutation 
of the natural numbers. Put simply, the Sn are the partial 
sums of V„±- in the obvious order, and the Tw are the 
partial sums in some other order specified by p. The Sn form 
an £^ -martingale sequence that converges almost surely and 
in £^ to a finite random variable Sao', see |j9j Chapter 12]. 
Similarly, Tjv converges to a finite random variable Too- With 
a bit more effort, and appealing to the Backwards Convergence 
Theorem [38] , it can be shown that 5*00 — Too almost surely. 

Since rearranging summands (almost surely) does not affect 
the random sum V„ ±- — and the same holds for the 
weighted sums X^n ^n^^^'^H^™) — the stochastic integral 

dYt{uj) can be defined to be the random variable Yi{(jj) — 
Yq(lS) without fear of the bounded convergence theorem 
failing (except on a set of measure zero). Here, Y\[lij) — Yq(lS) 
comes from the particular ordering limAr_j.oo ^Yl!l=\ ^t+ ~^t^- 

That Lebesgue-Stieltjes theory cannot define dYt{uj) as 
Yi(w) — YqIlo) does not make this pathwise definition incor- 
rect. The Lebesgue-Stieltjes theory requires absolute conver- 
gence of sums, meaning all arrangements have the same sum. 
The theory of stochastic integration weakens this to requiring 
any two arrangements have the same sum almost surely (or 
converge in some other probabilistic sense). The difference 
arises because there are an infinite number of arrangements 
of summands and an infinite number of outcomes lu. If uj is 
chosen first then two arrangements can be found whose sums 
disagree, yet if two arrangements are chosen first then their 
sums will agree for almost all lo. 

B. Brownian Motion as an Integrator 



The discussion in Section VI-A remains insightful when 
Yt is Brownian motion. The basic hope is that ( [29] l can be 
evaluated by choosing a sequence of partitions {0 — t\ 



(fe) 
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(k) (k) 

t\ ' < ■ ■ ■ < t]y(^) =1} with mesh size sup 
going to zero as A; — > oo, and evaluating 



.(fc) _ Ak), 



Ar<'='-1 



lim 

fc— ^CJO ^ — ^ 

n=0 



X 



(31) 



This can be interpreted as approximating Xt by a sequence of 
step functions, computing the integral with respect to each of 
these step functions, and taking limits. In particular, if different 
partitions lead to different limits, the dominated convergence 
theorem cannot hold. 



A standard manipulation pOl p. 30] shows ( 3 1 1 simplifies to 

2 ^n=Q 



\Bl — ^ Z]n=o ^ {B^(k) — B^(k) Y for the particular example 

BtdBt- If Bt is Brownian motion then the situation is 
somewhat delicate because the sample paths are only H older 
continuous of orders strictly less than i; section IV-A This 

12] 

(32) 



manifests itself in the existence of a partition such that 



lim sup 



-B (fc) 



B^, 



almost surely, whereas the limit in probability is 



lim 

fc— >-oo 



E ( 

n=0 



B Ak) 



= 1, 



(33) 



with ( [33] l holding for every partition. Therefore, it appears 
stochastic integrals cannot be defined almost surely, but at 



best in probability. This is why, in ( 19 1, the convergence is in 



probability. Section VI-D discusses the difference 



It is claimed in |58| that almost sure convergence is achiev- 
able; how is this possible without breaking the dominated 
convergence theorem? Crucially, the claim is that stochastic 
integrals can be evaluated using a particular sequence of parti- 
tions yielding an almost sure limit. The dominated convergence 
theorem still only holds in probability. 



If ( 3 1 1 converges almost surely then it is called a pathwise 
evaluation of the corresponding stochastic integral. 



Technicality: There are two standard ways of ensuring (33 i 



converges almost surely |38, Theorem 1.28]. The first is to 
shrink the size of the partition sufficiently fast to satisfy the 
hypothesis of the Borel-Cantelli lemma. The second is to use 
a nested partition, meaning no points are moved or deleted 
as k increases, only new points added. The same basic idea 



holds for ( 3 1 1 except care is required if the integrand contains 
jumps; see for example |58j Theorem 2]. 



C. Pathwise Solutions of Stochastic Differential Equations 

There are several meanings of pathwise solutions of stochas- 
tic differential equations. Some authors mean merely a strong 
solution. (Strong and weak solutions are explained in |10|.) 
Or it could mean a limiting procedure converging almost 
surely f59\, in accordance with the definition of pathwise 



evaluation of integrals in Section VI-B In the strongest sense, 
it means the solution depends continuously on the driving 
semimartingale |37|, also known as a robust solution. 

Robust solutions are relevant for filtering. Roughly speak- 
ing, integration by parts applied to the Kallianpur-Striebel 



formula leads to a desirable version of the conditional ex- 



pectation |]60|, ||6T|. 

In general, robust solutions will not exist for stochastic 
differential equations with vector-valued outputs. The recent 
theory of rough paths demonstrates that continuity can be 
restored by only requiring the output be continuous with 
respect to not just the input but also integrals of various higher- 
order products of the input f3ri-|l34l, ||62|. 



As explained in Section VI-D not being able to find a robust 
or even a pathwise solution is often inconsequential in practice. 
Technicality: This primer encourages understanding stochastic 
equations by thinking in terms of limits of piecewise linear 
approximations. Piecewise linear interpolation of sample paths 
using nested dyadic partitions are also good sequences to use 
in the theory of rough paths ||63j Section 3.3]. 

D. Almost Sure Convergence and Convergence In Probability 

Let Zk{LL>) be a sequence of real- valued random variables. 
Fixing Ld means Zk{uj) is merely a sequence of real numbers. 
If this sequence converges for almost all a; then Zk converges 
almost surely. A weaker concept is convergence in probability. 
If Z is a random variable and, for all e > 0, limfe_j.oo PrdZfc — 
Z| > e) = then Zk converges to Z in probability. 

If Zfc converges to Z in probability, but not almost surely, 
and uj is fixed, then Z{ijj) cannot necessarily be determined 
just from the sequence Zk{ijj). Computing a limit in probabil- 
ity requires knowing the behaviour of the Zk{io) as uj changes, 
explaining why the term "pathwise" is used to distinguish 
almost sure convergence from convergence in probability. 

This may lead to the worry that a device cannot be built 
whose output is a stochastic integral of its input, e.g., a filter. 
Indeed, only a single sample path is available in the real world! 
However, a device will never compute limZfc(w) (unless a 
closed-form expression is known), but rather, Zj({uj) will be 
used in place of Z{bj), where K is sufficiently large. And 
Zk{oj) can be computed from a single sample path. 

In practice, what often matters is the mean-square error 
E [{Zk - Zf\. Neither almost sure convergence or conver- 
gence in probability implies convergence in mean-square. 
Importantly then, convergence in mean-square can be used 
when defining stochastic integrals |8 |. (Convergence in mean- 
square implies convergence in probability. Convergence in 
probability only implies convergence in mean-square if the 
family of random variables \Z}S^ is uniformly integrable.) 
Note: A good understanding is easily gained by recalling first 
the representative example of a random variable converging in 
probability but not almost surely |64| and then studying how 



a bad partition satisfying (32i is constructed in |12, p. 365]. 
Technicality: Proofs in probability theory look different from 
proofs in functional analysis because convergence in probabil- 
ity is a non-locally convex topology. Also, there is no topology 
corresponding to almost sure convergence 1 ,65) . 

E. Integrals as Linear Operators 

Ordinary integrals are nothing more than (positive) linear 
functionals on a function space | |66| . The situation is no more 
complicated in the stochastic setting. 
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A stochastic integral /: L £ is a linear operator 
sending a process Xt to a random variable Z = I{Xt), more 
commonly denoted Z ~ Xt dBt for some fixed process 
Bt- Here, L is a vector space of real-valued random processes 
and £ a vector space of real-valued random variables. 

To be useful, / should satisfy some kind of a bounded 
convergence theorem. This generally necessitates, among other 
things, restricting the size of L. (In Ito's theory, L is usually 
the class of adapted processes with caglad paths.) 

Given suitable topologies on L and C, one way to define 
/ is to define it first on a dense subset of L then extend by 
continuity: if X^'^^ — >■ Xt then declare I{Xt) = liuik Xt'^K 
(Protter takes this approach, albeit when / maps processes 
to processes [7|.) This only works if the topology on L is 
sufficiently strong that two different sequences converging to 
Xt assign the same value to I{Xt). Alternatively, restrictions 
can be placed on the allowable approximating sequences; 
the value of I{Xt) is evaluated as the limit of I{x[^'') but 
where there are specific guidelines on the construction of the 
sequence Xt Xt-ln both cases, the topology on C is also 
important; too strong and I{x[^^) need not converge. 

If convergence in probability is used on C then there is the 
perennial caveat that the limit is only unique modulo sets of 
measure zero. Although this means / is not defined uniquely, 
any given version of / will still be a map taking, for fixed lo, 
the sample path 1 1-)- Xt{Lo) to the number Z{uj) — I{Xt){uj). 
In this respect, using the term pathwise to exclude convergence 
in probability is misleading. While a version of I may not be 
constructable one path at a time in isolation, once a version 
of / is given, it can be applied to a single path in isolation. 
(The same phenomenon occurs for conditional expectation.) 



VII. Stratonovich Equations on Manifolds 

Starting from this section, the level of mathematical detail 
increases, leading to the study, in Section |Xj of an estimation 
problem on compact Lie groups. 

Let M C M" be a d-dimensional manifold. Denote by 
Bt Brownian motion in W'- with B] its ith element. A class 
of processes on M can be generated by injecting Brownian 
motion scaled by a function h. Precisely, the Stratonovich 
stochastic differential equation 



Xt 



t d 

E 



h,{t,Xt)odBl 



(34) 



defines a process Xt that, if started on M, will remain on M 
if the hi{t,Xt) always lie in TxtM. In such cases, for any 
smooth function /: M ^ M, the projected process f{Xt) on 
]R satisfies 



f{Xt) = 



t d 

E 



df{h,{t,Xt))odBl 



(35) 



where df{vp) is the directional derivative of / at p in the 
direction Vp E TpM. (If / is locally the restriction of a 
function F: U C M" — > M defined on an open subset of 
M" containing p then df{vp) is the usual directional derivative 
of F at p in the direction Vp.) 



Technicality: If M was not embedded in K", so tha t p4l ) was 
not a stochastic equation in Euclidean space, then \i5\ could 
be used to define (34i: Xt is the semimartingale on M for 



which ( 35 I holds for every smooth function /. 



VIII. Compact Lie Groups 



Lie groups |67|, | [68) are simultaneously a manifold and a 
group. The group operations of multiplication and inversion 
are smooth functions, thereby linking the two structures. 
Extending an algorithm from Euclidean space to manifolds 
is often facilitated by first extending it to compact Lie groups 
where the extra structure helps guide the extension. 

The simplest example of a compact Lie group is the circle 
with group multiplication (cos 0, sin 0) • (cos sin(/)) = 
(cos(6'+0),sin(6'+0)) dxvd identity element (1,0) S S*^ C M?. 
Being a subset of Euclidean space, the circle is seen to be 
compact by being closed and bounded (Heine-Borel theorem). 
Technicality: Compactness is a topological property. In a 
sense, compact sets are the next simplest sets to work with 
beyond sets containing only a finite number of elements. 
Compact sets are sequentially compact: any sequence in a 
compact set has a convergent subsequence. 

A more interesting example of a compact Lie group is 
the special orthogonal group SO{n) defined as the set of all 
nxn real-valued orthogonal matrices with determinant equal 
to one, with matrix multiplication as group multiplication. 
This implies the identity matrix is the identity element. By 
treating the space of n x n real-valued matrices as Euclidean 
space — the Euclidean inner product iQ {A, B) = Tt{B^ A} 
where Tr denotes the trace of a matrix and superscript T 
is matrix transpose — it can be asked if SO{n) meets 
the requirements in Section II-D of being a manifold (of 



dimension ^(n— 1)), to which the answer is in the affirmative. 
Furthermore, by considering elementwise operations, it can be 
seen that group multiplication and inversion, which are just 
matrix multiplication and inversion, are smooth operations. 

The group 5*0(3) has been studied extensively in the 
physics and engineering literature [69]. Studying 5*0(3) is the 
same as studying rotations of three-dimensional space. 

A. The Matrix Exponential Map 

Choose a matrix X e E"^" satisfying X'^ X = I and 
detX = 1. In other words, choose an element X of SO{n). 
Since SO{n) is a group, it is closed under multiplication, 
hence the sequence X^^X^,X^,X^ , ■ ■ ■ lies in 50(n), form- 
ing a trajectory. The closer X is to the identity matrix, the 
closer successive points in the trajectory become. Note that 
X is a rotation of M", hence X" is nothing more than the 
rotation X applied n times. 

The idea of interpolating such a trajectory leads to the curve 
7(t) = exp(iA) where exp is the matrix exponential | |70) . 
Indeed, if A € M"""" is such that Gxp(A) = X then X^= 
exp(2A), X'^ = exp(3A), and so forth. The set of A for 

'The vcc operator identifies the space R"*^" with the space R" by 
stacking the columns of a matrix on top of each other to form a vector. 

2 

Under this identification, the Euclidean inner product on R" becomes the 
inner product {A, B) = (vcc _B)^(vec ^) = Tr{_B^A}. 
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VIII-A 



which exp(^) is an element of 5*0(71) forms a vector space; 
this nontrivial fact relates to SO{n) being a Lie group. Let 
so{n) denote this set; it is the Lie algebra of SO{n). 

Since exp(A) will lie in SO{n) if exp{tA) lies in SO{n) 
for an arbitrarily small i > 0, it suffices to determine 
so{n) by examining the linearisation of exp. The constraint 

exp(tA)^ exp(i^) = / implies {I+tA-\ )^(/+tylH ) = 

I from which it follows that so{n) ^ {A e M"""" | A + A'^ = 
0}. This shows also that so{n) is the tangent space of SO{n) 
at the identity: so{n) = TiSO{n). 

B. Geodesies on SO{n) 

Recall from earlier that the Euclidean inner product on 
K"^" is {A,B) = Ty{B^A}. a geodesic on SO{n) is a 
curve with zero acceleration, where acceleration is measured 
by calculating the usual acceleration in ]R"^" then projecting 
orthogonally onto the tangent space. 

It may be guessed that X'^ , , , • • • in Section 
lies on a geodesic, consequentially, j{t) = exp{tA) should be 
a geodesic whenever A e so{n). The acceleration in Euclidean 
space is 'j"{t) = cxp{tA)A^ . If this is orthogonal to the 
tangent space of SO{n) at j{t) then 7 is a geodesic. 

Fix t and let Z = '^{t) = cxp{tA). The tangent space at 
Z g SO{n) is the vector space Z so{n), that is, matrices 
of the form ZC with C £ so{n). This shows one of the 
many conveniences of working with manifolds that are also 
groups; group multiplication moves the tangent space at the 
identity element to any other point on the manifold. Then 
{-f"{t),ZC) = {ZA^, ZC) = (A^ C) = whenever A and C 
are skew-symmetric matrices, proving 7 is indeed a geodesic. 

A geodesic is completely determined by its initial position 
and velocity, hence every geodesic starting at the identity 
element is of the form t exp(tA) for some A £ so{n). 

All geodesies starting at X £ SO{n) are of the form t 1-^ 
X exp{tA), that is, left multiplication by X sends a geodesic 
to a geodesic. Right multiplication also sends geodesies to 
geodesies. (This implies that for any B £ so{n) and X e 
SO{n), there must exist an A G so{n) such that X exp(tA) = 
exp{tB)X; indeed, A = X^BX.) 

If 7(t) = Xexp[tA) then 7'(0) = XA. Therefore, the 
Riemannian exponential map is given by 

¥.xpx{XA) = Xexp(A), X e SO{n), A e so{n). (36) 

C. Why the Euclidean Norm? 

The matrix exponential map relates only to the group struc- 
ture; t i-> exp(tj4) generates a one-parameter subgroup. The 
Riemannian exponential map relates only to the Riemannian 
geometry. For there to be a relationship between cxp and Exp 
requires a careful choice of Riemannian metric. 

The special relationship between SO{n) C M"^" and the 
Euclidean inner product on M"^" is that the inner product 
is bi-invai-iant, meaning {A,B) = {XA,XB) = {AX.BX) 
for any A, B E so{n) and X E SO{n). In words, the inner 
product between two velocity vectors at the identity element 
of the Lie group is equal to the inner product between the 
same two velocity vectors should they be "shifted" to another 



point on the Lie group, either by left multiplication or by right 
multiplication. This is why both left and right multiplication 
preserve geodesies. (Note; Since a tangent vector is the ve- 
locity vector of a curve, the terms tangent vector and velocity 
vector are used interchangeably.) 

Technicalities: On a compact Lie group, a bi-invariant metric 
can always be found. Different bi-invariant metrics lead to the 
same geodesies. The most familiar setting to understand this 
in is Euclidean space; changing the inner product alters the 
distance between two points, but the shortest curve is still the 
same straight line. Similarly, on a product of circles, which is 
a compact Lie group, a different scaling factor can be applied 
to each circle to obtain a different bi-invariant metric, but the 
geodesies remain the same. 

D. Coloured Brownian Motion on SO{n) 

On an arbitrary Riemannian manifold, there is essentially 
only one kind of Brownian motion. On Lie groups (and other 
parallelisable manifolds), the extra structure permits defining 
coloured Brownian motion having larger fluctuations in some 
directions than others. 

If Gt is a process on SO{n), it has both left and right in- 
crements, namely, Gj^Gs+t and Gs+tGj-^ respectively. Non- 
commutativity implies these increments need not be equal. 
This leads to distinguishing left-invariant Brownian motion 
from right-invariant Brownian motion. As one is nevertheless 
a mirror image of the other — if Gt is left-invariant Brownian 
motion then G^^ is right-invariant Brownian motion — it 
generally suffices to focus on just one. 

A process Gt on SO{n) is a left-invariant Brownian motion 
if it has continuous sample paths and left increments that are 
independent of the past and stationary. (See |3j V.35.2] for de- 
tails.) If it were stochastically anti-developed then a coloured 
Brownian motion plus possibly a drift term would result. The 
drift term is a consequence of the non-commutativity. 

An algorithm for simulating left-invariant Brownian motion 
on SO{n) is now derived heuristically. Unlike in Section [ll] 
stochastic development is not used because left-invariant 
Brownian motion relates to the group structure, not the geo- 
metric structure. Nevertheless, the basic principle is the same: 
replace straight lines by one-parameter subgroups. The only 
issue is how to specify a random velocity in a consistent way 
to achieve stationarity. The velocity vector of X cxp{tA) at X 
is XA. Since a left-invariant process is sought, the velocity 
vector XA at X should be thought of as equivalent to the 
velocity vector A at the identity (that is, left multiplication 
is used to map TiSO{n) onto TxSO{n)). This leads to the 
following algorithm. 

Choose a basis Ai,--- ,Ad for so{n) and let fi^.i be a 
doubly-indexed sequence of iid N{0, 1) Gaussian random 
variables. Then a left-invariant Brownian motion starting at 
Wo G SO{n) is approximated by 



W(^k+i)5 = Wks exp VS y^^/3k,iAi 



=1 



(37) 



for k = 0, 1,2,---, where 5 > is a small step size. If 
necessary, these discrete points can be connected by geodesies 
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to form a continuous sample path. That is, for an arbitrary t 
lying between k6 and (fc + 1)6, 



Wf is coloured Brownian motion with covariance C started 



Wt = Wks exp 



t-kS 



(38) 



The left increments W^lW(k+i)s are stationary and inde- 
pendent of the past values {Wt \ t < kS}, as required of 
left-invariant Brownian motion. 

An outstanding issue is the effect of changing the ba- 
sis Ai. Comparing ( [37] i with the corresponding formula for 
(white) Brownian motion in Section namely, W(^k+i)s — 
Zk ) where each Zk is an "iV(0, /)" Gaussian 
random variable on the tangent space TwksSO{n), shows 



that ( 37 1 produces white Brownian motion if the Ai are an 
orthonormal basis, as to be expected. 



Changing basis vectors in ( 37 1 is equivalent to changing 
the colour of the driving Gaussian noise pk.i- Therefore, 
coloured Brownian motion with covariance C is defined as 
(the Umit as (5 ^ of) the process (l37|-(l38]l where {Ai} 



is a given orthonormal basis and the iid random vectors 
Pk = (/3fc,i, • • • ,h,d) have distribution A^(0, C). 

E. Formal Construction of Brownian Motion 



Taking limits in ( 37 1-( 38 i leads to a Stratonovich equation 
for left-invariant Brownian motion on a Lie group G of 
dimension d. Let , • • • , be a basis for the Lie algebra. 
Denote by hi the extension of Ai to a left-invariant vector 
field on G; for G = SO[n), this is simply hi{X) = XA,. Let 
Bt be Brownian motion on M.'^. Then the solution Wt of 



Wt = 



t d 

E 



h,{Wt) o dBl 



(39) 



is a left-invariant Brownian motion on G. 

As in Section |VIII-D| if the Ai are an orthonormal basis 
and the Bt is coloured Brownian motion with E [5^5^] — tC 
then Wt is coloured Brownian motion with covariance G. 
Technicality: Whether Bt is white Brownian motion and the 
Ai changed, or the Ai are orthornormal and the colour of Bt 
changed, is a matter of choice. 

IX. Brownian Distributions 

There is a two-way relationship between Gaussian random 
variables and Brownian motion in M. Brownian motion can 
be defined in terms of Gaussian distributions. Conversely, a 
Gaussian random variable can be generated by sampling from 
Brownian motion Bt: the distribution of Bi is N{0, 1). 

Lack of linearity prevents defining Gaussian random vari- 
ables on manifolds. Nevertheless, sampling from Brownian 
motion on a manifold produces a random variable that, if 
anything is, is the counterpart of a Gaussian random variable. 
Such a random variable is said to have a Brownian distribution. 

Formally, given a Lie group G of dimension d, an element 
g ^ G and a positive semidefinite symmetric matrix G G 
j^dxd^ a random variable X with (left) Brownian distribution 
N{g, G) has, by definition, the same distribution as Wi, where 



at Wo = g; see Section |VIII-E| That is, coloured Brownian 
motion started at g is run for one unit of time and where it 
ends up is a realisation of N{g, G). 

X. Parameter Estimation 

Given a compact Lie group G, let j/i, 1/2, ■ ■ • be iid N{g, G) 
distributed, as defined in Section |IX] The aim is to estimate 
g and G from the observations y^. In |7|, it was realised that 
by embedding G in a matrix space M"^", an estimate of g 
and (sometimes) G is easily obtained from the average of the 
images Yi £ M"^" of the yi £ G under the embedding. This 
section re-derives some of the results at a leisurely pace. For 
full derivations and consistency proofs, see \JJ. 

A. Estimation on SO{2) 

Although SO (2) is nothing other than the unit circle S*^ in 
disguise, the theory below will be required later 

If X is a 2 X 2 orthogonal matrix then its first column 
corresponds to a point on the unit circle because the sum 
of the squares of the two elements is one, and its second 
column, having unit norm and being orthogonal to the first 
column, has only two possibilities, corresponding to whether 
the determinant of X is 1 or —1. If X e S'0(2), its second 
column is therefore fully determined given its first column 
because the determinant of X must be 1. (The relationship is 
deeper: S0{2) is Lie group isomorphic to S^.) 

Let y e SO{2) be randomly drawn from the distribution 
N{g,a'^) where the covariance matrix G has been replaced 
by the scalar cr^ because the dimension of 5*0(2) is 1. Since 
50(2) C M^^^, y can be treated equally well as a random 
matrix, and treating it as such, it is of interest to determine its 
expected value, which in general will not lie in 50(2). 



Let V be the (infinitesimal) generator 1 10 Sec. 7.3] of Wt, 
that is, for any smooth "test function" / : G — ?> M, 



2?/(.9) 



lim 



[fiWt)] - fig) 



(40) 



where Wt is Brownian motion starting at g with covariance 
matrix G; the superscript g on the expectation symbol signifies 



this dependence on g. It follows from (39 1 and (35 1 in a well- 
known way that 



d 



(41) 



where A^ is the left-invariant vector field generated by A. 
Vector fields act on functions by directional differentiation; 

{A^f){g) = il^^figexpiAt)). Therefore, if / is linear. 



d 

^fia) = ^ E C\,figAA,). 



(42) 



This will be derived from first principles for 50(2). Let 



A = 



-1 

1 



(43) 
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be the chosen orthonormal basis for so(2) with respect to the 
scaled EucHdean inner product {X,Y) = {1/^2) Tr{y^X}. 
The scaling term is just a matter of convenience (and con- 
vention). Since A is skew-symmetric and hence normal, it is 
diagonalisable: Q~^AQ — D where D = diag{— j} and 



Q = \/2 



(44) 



As ^ I 0, (37 1 can be used to approximate Wt, yielding 

/oo 
geM^tpA)p{li)dl3 (45) 
-oo 

where p{P) is the probability density function of N{Q, a^). By 
writing exp(v^/3A) — Q cxp{^/i(5D) Q^^ and remembering 
the characteristic function of the distribution A^(0, cr^), it 
follows that E3 [Wt] ~ exp(-^) g. Therefore, 



lim«to^(-^ 
tio t 



15- 



(46) 



If / is the restriction of a linear function on M^^^ then it 
can be interchanged with all the operations in (|40]l, hence 
T>f{g) — /(— ^ 5) = /(.9)- Fortunately, this agrees with 
(42 1 because = — /, validating the approximation (45 1. 



Time-homogeneity of (39 1 means Vf(Wt) is the instanta- 
neous rate of change of E for all t, not just t = 0. 
Roughly speaking, Kolmogorov's backward equation JTOj Sec. 
8.1] evaluates E [/(ll^t)] by integrating these infinitesimal 
changes. Since the changes are path dependent, a PDE rather 
than an ODE is required: u{t,g) — E^ [/(T^t)] satisfies 
3u 

— =2?u, u{0,g)^f{g) (47) 

where V acts on the function g h- > u{t, g). Although / should 
map to M, letting it be the identity map on M^^^ does no harm, 
in which case the solution to (47 1, now a PDE on M^^^, is 

(48) 



W [Wt]^u{t,g)^eM~^)9- 



It just so happens that this true solution agrees with the earlier 
approximation for E^ \Wt\. In general this will not be the case 
and solving the PDE will be necessary. 
Technicality: Kolmogorov's backward equation is a PDE for 
evaluating u{t,g) = E [/(Wr) \ ^t^g\ for a fixed T by 
integrating backwards from time i = T. By time homogeneity, 
u(t,g) = E[/(iyT-t) I T^o = g\ = u{T ~ t,g), explaining 
why (|47]l integrates forwards from time T — T = 0. 

In summary, it has been derived from first principles that 



(49) 



Given yi, • • • ,ym G 5*0(2), the method of moments dic- 
tates equating the sample mean with the true mean, namely, 
estimating g and cr^ by solving cxp{—^)g — y where 
y ^ m Si!li Vi is the (extrinsic) sample average obtained by 
treating the as 2 x 2 matrices. As y will not necessarily be 
of the form exp(— ^) g, the equation should be solved in the 
least-squares sense, facilitated by the polar decomposition of 
y. If y = UP where U is orthogonal and P positive-definite 
symmetric then g can be estimated by U and exp(— ^) can 
be estimated by ^ Tr{P}. Equivalently, g can be estimated by 
the "Q" matrix in the QR decomposition of y. 



B. Estimation on 50(3) 

The same steps are followed as in Section X-A Choose the 
basis Ai, A2, A3 to be, respectively. 



(50) 
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If / is linear on the ambient matrix space then, from (42i, 

1 



Vfig)^figZ), Z 



2 5Z '^ij^i^r 



(51) 



As before, a solution to ~ T>u is u{t,g) — gexp{tZ); note 
u is linear in g for all t, hence the validity of using ( [5T[ ) which 
neglects nonlinear terms of /. In summary, if y ^ N{g, C) 
then E [y] — gexp{Z). 

A polar decomposition of the sample average allows the 
estimation of g and exp(Z) because exp(Z) is a positive- 
definite symmetric matrix, a consequence of Z being sym- 
metric (which in turn is due to C being symmetric and the Ai 
anti-symmetric). The only remaining question is if C can be 
recovered from Z. Direct expansion shows that for 50(3), 



Z = -- 



1 



Oil + O22 
O32 
—Oai 



O23 
Oil + O33 
O21 



— Oi3 

012 
022 + 033 



(52) 



Therefore, for 50(3), the parameters g and O can be estimated 
from a polar decomposition of the sample average. 

C. Estimation on SO{n) for n > 3 

The above ideas will not allow all elements of O to be 
estimated in higher dimensional cases for the simple reason 



that the dimension of 50(n) is d 



r(n — 1), hence 



N{g, C) disti-ibution on 50(n) requires d+^{d+l) = f (n- 
l)(n^ — n + 6) real-valued parameters for its specification, 
which would exceed n^, the dimension of the ambient space 
jjnxn ^ SO{n), if n were larger than three. 

Nevertheless, the calculations in the previous section remain 
valid for n > 3. The expected value of y ^ N{g, O) is 
gcxp(Z) where Z is given by (51 1. In particular, g can be 
estimated as before, either via a polar decomposition or as the 
"Q" matrix in the QR decomposition of the sample average. 
Furthermore, if C is known to have a particular structure, the 
simplest example being if O is diagonal, then it may still be 
possible to recover C from Z. 

D. Estimation on a Compact Lie Group 

The above ideas extend directly to SU{n) and U{n), the 
real (not complex!) Lie groups consisting of complex-valued 
unitary matrices, the former with the additional requirement 
that the determinant be one. See JT) for details. 

Let G be an arbitrary compact Lie group. Group represen- 
tation theory studies (among other things) smooth maps of the 
form / : G — > U{n) which preserve group operations, meaning 
fig-') = [fig)]-' and f{gh) = f{g)f{h). Such maps ai-e 
called (unitary) representations. 

Let / be a representation of G and y ^ N{g, O), the latter 
with respect to the orthonormal basis Ai, - ■ ■ ,Ad for the Lie 
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algebra of G. The same steps as before will be used to calcu- 
late [/(y)]- Although / is not linear, it has another simplify- 
ing feature instead; it preserves group operations. In particular, 
A^f{g) = ifige'^') = = f{9){A^f) where 

the intermediate derivatives are to be evaluated at t = and 
Aif denotes the derivative of / at the identity element in the 
direction Ai. Note that Aif will be an element of the Lie 
algebra of U{n), that is, a skew-Hermitian matrix. Therefore, 

Vf{9)^f{g)Zf, Z/ = i^C.,(AJ)(A,/) (53) 

where is a symmetric matrix. It is readily verified that 
u{t,g) = f{g)exp{tZf) solves ^ — Vu. (The map 
g 1-^ u{t,g) preserves group multiplication which is the only 
requirement for ( [53] l to be valid.) Thus, 

E[fiy)] = fig)cMZf) (54) 

where Zf is the matrix given in ( [53] l. Regardless of the choice 
of /, Zf will be symmetric and so exp{Zf) will be a positive- 
definite symmetric matrix. 

Once again, a polar decomposition can be used to estimate 
f{g) and exp{Zf). How much of g and C can be recovered 
depends on the mappings g i-> f{g) and C t-^ Zf. 

There is always (at least) one representation f : G U{n) 
that can be written down immediately for a compact Lie group 
equipped with a bi-invariant metric, and that is the adjoint 
representation. Recall that the adjoint representation Ad : G — > 
Aut(g) maps a group element g e G to an automorphism 
of the corresponding Lie algebra g. Given an orthonormal 
basis for g, any automorphism of g can be represented in 
matrix form with respect to this basis. A consequence of the 
basis being orthonormal and the metric being bi-invariant is 
that these matrices will be unitary. That is, by choosing an 
orthonormal basis, the adjoint representation can be written in 
the form f : G ^ U (n) where n is the dimension of G. This 
allows the application of the above extrinsic averaging method 
to arbitrary compact Lie groups. 

Furthermore, the adjoint representation (or any other rep- 
resentation) can be used to augment any information already 
available about the parameters g and G. That is to say, given 
iid samples j/i,-- - ,ym ^ N{g,G), in addition to forming 
y ^ — J2i Vi (assuming the yi belong to an appropriate Lie 
group such as 5*0(71)), the average — f{Vi) can also be 
formed. Its polar decomposition would yield an estimate of 
f{g) and of Zf, thereby supplementing existing information 
about g and C. This can be repeated for any number of 
representations /i , /2 , ■ ' ' ■ 

XI. A Few Words on Manifold-Valued Estimators 

Estimation theory extends to manifolds ||7T)-||73|. Several 
concepts normally taken for granted, such as unbiasedness of 
an estimator, are not geometric concepts and hence raise the 
question of their correct generalisations to manifolds. 

The answer is that the difficulty lies not with manifolds, 
but with the absence of meaning to ask for an estimate of 
a parameter. The author believes firmly that asking for an 
estimate of a parameter is, a priori, a meaningless question. 



It has been given meaning by force of habit. An estimate only 
becomes useful once it is used to make a decision, serving 
as a proxy for the unknown true parameter value. Decisions 
include: the action taken by a pilot in response to estimates 
from the flight computer; an automated control action in re- 
sponse to feedback; and, what someone decides they hear over 
a mobile phone (with the pertinent question being whether the 
estimate produced by the phone of the transmitted message is 
intelligible). Without knowing the decision to be made, whether 
an estimator is good or bad is unanswerable. One could hope 
for an estimator that works well for a large class of decisions, 
and the author sees this as the context of estimation theory. 

The definition of unbiasedness in | |7l| with respect to a loss 
function accords with this principle of application dependence. 
Otherwise, two common but ad hoc definitions of the mean 
are the extrinsic mean and the Karcher mean. The extrinsic 
mean of a random variable a; S Af C M" simply forgets 
the manifold and takes the mean of x treated as an M"- 
valued random variable. The extrinsic mean generally wifl 
not lie on M. The Karcher mean ||74), | |75| looks to be an 
ideal generalisation of the mean, but on close inspection, 
problems arise if the manifold has positive curvature and 
the samples are sufficiently far apart to prevent the Karcher 
mean from existing or being unique. (Since manifolds are 
locally Euclidean, asymptotic properties of estimators are more 
readily generalised to manifolds.) 

In summary, the author advocates working backwards, start- 
ing from a (representative) decision problem and deriving a 
suitable estimator for making good decisions. This viewpoint 
should clarify otherwise arbitrary choices that must be made 
along the way. 

Technicality: The James-Stein estimator helps demonstrate the 
definition of a good estimator is application dependent | |76) : 
remarkably, in terms of minimising the mean-square error, the 
best estimate of /i G given X ^ N{fi, I) is not 'p, = X. 
Since usually (but not always) the "best" estimator to use in 
practice really is = X, this illustrates a shortcoming of the 
default assumption that mean-square error is always a good 
measure of performance. 

XII. Conclusion 

Pertinent background not easily accessible in the literature 
on how to understand and work with stochastic processes 
on manifolds was presented. Stochastic development was 
introduced to ease the transition from Euclidean space to 
manifolds: stochastic development maps each sample path in 
Euclidean space to a sample path on a manifold that preserves 
the infinitesimal properties of the process. Not surprisingly 
then, Brownian motion on a manifold is the stochastic devel- 
opment of Brownian motion in Euclidean space. It was also 
explained that stochastic development, Stratonovich integrals 
and Stratonovich differential equations could all be understood 
as limits of piecewise linear approximations in Euclidean 
space and, more generally, as limits of piecewise geodesic 
approximations on manifolds. 

Lie groups are manifolds with a compatible group structure. 
The group structure permits defining "coloured" Brownian mo- 
tion on Lie groups. Sampling from coloured Brownian motion 
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produces random variables with Brownian distributions that 
are generalisations of Gaussian random variables. Formulae 
from I?] were re-derived for estimating the parameters of 
Brownian distributed random variables. The derivation demon- 
strated that standard techniques, such as using Kolmogorov's 
backward equation, remain applicable in the more general 
setting of processes on manifolds. 
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